Contents
1. Introduction
The hotel industry is a crucial part of the global economy, with a significant contribution to many countries’ economies. In this report, we aim to provide insights into the booking demand in the hotel industry by analyzing a Kaggle dataset.
The purpose of this analysis is to identify patterns in the data that can be useful in making informed decisions for hotel owners and managers. By exploring trends in booking demand, we can gain valuable insights into customer behavior and preferences, which can ultimately lead to better management of resources and increased revenue.
In the following sections, we will describe the data sources, methods, and results of our analysis, with a focus on providing a clear and comprehensive narrative that is easy to follow and understand. By the end of this report, we hope to have demonstrated the value of data-driven decision-making in the hotel industry and provided actionable insights that can be applied in real-world settings.
2. Data Sources
The data used for this analysis is sourced from the “Hotel Booking Demand” dataset available on Kaggle. This dataset contains information on hotel bookings made by customers. The data was originally sourced from two hotels, one located in Portugal and the other in Tunisia.
The dataset consists of:
hotel_bookings.csv: This file contains data on bookings made from July 2015 to August 2017. It contains 32 columns and 119,390 rows. The columns in the dataset include information on the customer, the booking, and details about the stay. Here is a table summarizing the columns and their descriptions in the hotel_bookings.csv file:
Click to show/hide code
#create a table with column name and description using kableExtra package
# Read hotel booking data from CSV file
hotel_data <- read.csv("hotel_bookings.csv", stringsAsFactors = FALSE)
# Create a vector of column names
column_names <- colnames(hotel_data)
# Create a vector of column descriptions
description <- c("Unique identifier for the booking",
"ID of the hotel where the booking was made",
"ID of the hotel type",
"Number of occupants",
"Date of arrival",
"Date of departure",
"Number of days between arrival and departure",
"Number of previous bookings by the customer",
"Number of previous bookings for the specific hotel",
"Number of previous bookings for the same customer in the same hotel",
"Indication if the booking was canceled (1) or not (0)",
"Type of booking",
"Average daily rate as defined by dividing the sum of all lodging transactions by the total number of staying nights",
"Indication of whether the customer made a deposit to guarantee the booking",
"Date on which the deposit was made",
"Number of customer complaints",
"ID of the country of origin",
"Indication of whether the booking is for a resort hotel (1) or not (0)",
"Indication of whether the booking is for a city hotel (1) or not (0)",
"Number of adults",
"Number of children",
"Number of babies",
"Booking type",
"Source of the booking",
"Market segment designation",
"Booking distribution channel",
"Name of the group if booking is associated with a group",
"Indication if the booking was canceled (1) or not (0)",
"Date at which the status was set to 'Canceled'",
"Indication of whether the booking was canceled before the arrival date (1) or not (0)",
"Indication of the final status of the booking",
"Date at which the final status was set")
library(kableExtra)
# Create a data frame for column names and descriptions
desc_df <- data.frame(column_names, description)
# Use kable function to format the table
kable(desc_df,
col.names = c("Column Name", "Description"),
align = "c") %>%
kable_styling(full_width = F,
position = "center",
font_size = 14)| Column Name | Description |
|---|---|
| hotel | Unique identifier for the booking |
| is_canceled | ID of the hotel where the booking was made |
| lead_time | ID of the hotel type |
| arrival_date_year | Number of occupants |
| arrival_date_month | Date of arrival |
| arrival_date_week_number | Date of departure |
| arrival_date_day_of_month | Number of days between arrival and departure |
| stays_in_weekend_nights | Number of previous bookings by the customer |
| stays_in_week_nights | Number of previous bookings for the specific hotel |
| adults | Number of previous bookings for the same customer in the same hotel |
| children | Indication if the booking was canceled (1) or not (0) |
| babies | Type of booking |
| meal | Average daily rate as defined by dividing the sum of all lodging transactions by the total number of staying nights |
| country | Indication of whether the customer made a deposit to guarantee the booking |
| market_segment | Date on which the deposit was made |
| distribution_channel | Number of customer complaints |
| is_repeated_guest | ID of the country of origin |
| previous_cancellations | Indication of whether the booking is for a resort hotel (1) or not (0) |
| previous_bookings_not_canceled | Indication of whether the booking is for a city hotel (1) or not (0) |
| reserved_room_type | Number of adults |
| assigned_room_type | Number of children |
| booking_changes | Number of babies |
| deposit_type | Booking type |
| agent | Source of the booking |
| company | Market segment designation |
| days_in_waiting_list | Booking distribution channel |
| customer_type | Name of the group if booking is associated with a group |
| adr | Indication if the booking was canceled (1) or not (0) |
| required_car_parking_spaces | Date at which the status was set to ‘Canceled’ |
| total_of_special_requests | Indication of whether the booking was canceled before the arrival date (1) or not (0) |
| reservation_status | Indication of the final status of the booking |
| reservation_status_date | Date at which the final status was set |
| Column Name | Description |
|---|---|
| hotel | Unique identifier for the booking |
| is_canceled | ID of the hotel where the booking was made |
| lead_time | ID of the hotel type |
| arrival_date_year | Number of occupants |
| arrival_date_month | Date of arrival |
| arrival_date_week_number | Date of departure |
| arrival_date_day_of_month | Number of days between arrival and departure |
| stays_in_weekend_nights | Number of previous bookings by the customer |
| stays_in_week_nights | Number of previous bookings for the specific hotel |
| adults | Number of previous bookings for the same customer in the same hotel |
| children | Indication if the booking was canceled (1) or not (0) |
| babies | Type of booking |
| meal | Average daily rate as defined by dividing the sum of all lodging transactions by the total number of staying nights |
| country | Indication of whether the customer made a deposit to guarantee the booking |
| market_segment | Date on which the deposit was made |
| distribution_channel | Number of customer complaints |
| is_repeated_guest | ID of the country of origin |
| previous_cancellations | Indication of whether the booking is for a resort hotel (1) or not (0) |
| previous_bookings_not_canceled | Indication of whether the booking is for a city hotel (1) or not (0) |
| reserved_room_type | Number of adults |
| assigned_room_type | Number of children |
| booking_changes | Number of babies |
| deposit_type | Booking type |
| agent | Source of the booking |
| company | Market segment designation |
| days_in_waiting_list | Booking distribution channel |
| customer_type | Name of the group if booking is associated with a group |
| adr | Indication if the booking was canceled (1) or not (0) |
| required_car_parking_spaces | Date at which the status was set to ‘Canceled’ |
| total_of_special_requests | Indication of whether the booking was canceled before the arrival date (1) or not (0) |
| reservation_status | Indication of the final status of the booking |
| reservation_status_date | Date at which the final status was set |
3. Data Loading
To begin our analysis, we first load the dataset, “hotel_bookings.csv” into R.
# Load data_hotelset .csv as data_hotel
data_hotel <- read.csv("hotel_bookings.csv")
# We can take a quick look at the data by calling the head() function.
head(data_hotel)Click to show/hide output
## hotel is_canceled lead_time arrival_date_year arrival_date_month
## 1 Resort Hotel 0 342 2015 July
## 2 Resort Hotel 0 737 2015 July
## 3 Resort Hotel 0 7 2015 July
## 4 Resort Hotel 0 13 2015 July
## 5 Resort Hotel 0 14 2015 July
## 6 Resort Hotel 0 14 2015 July
## arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights
## 1 27 1 0
## 2 27 1 0
## 3 27 1 0
## 4 27 1 0
## 5 27 1 0
## 6 27 1 0
## stays_in_week_nights adults children babies meal country market_segment
## 1 0 2 0 0 BB PRT Direct
## 2 0 2 0 0 BB PRT Direct
## 3 1 1 0 0 BB GBR Direct
## 4 1 1 0 0 BB GBR Corporate
## 5 2 2 0 0 BB GBR Online TA
## 6 2 2 0 0 BB GBR Online TA
## distribution_channel is_repeated_guest previous_cancellations
## 1 Direct 0 0
## 2 Direct 0 0
## 3 Direct 0 0
## 4 Corporate 0 0
## 5 TA/TO 0 0
## 6 TA/TO 0 0
## previous_bookings_not_canceled reserved_room_type assigned_room_type
## 1 0 C C
## 2 0 C C
## 3 0 A C
## 4 0 A A
## 5 0 A A
## 6 0 A A
## booking_changes deposit_type agent company days_in_waiting_list customer_type
## 1 3 No Deposit NULL NULL 0 Transient
## 2 4 No Deposit NULL NULL 0 Transient
## 3 0 No Deposit NULL NULL 0 Transient
## 4 0 No Deposit 304 NULL 0 Transient
## 5 0 No Deposit 240 NULL 0 Transient
## 6 0 No Deposit 240 NULL 0 Transient
## adr required_car_parking_spaces total_of_special_requests reservation_status
## 1 0 0 0 Check-Out
## 2 0 0 0 Check-Out
## 3 75 0 0 Check-Out
## 4 75 0 0 Check-Out
## 5 98 0 1 Check-Out
## 6 98 0 1 Check-Out
## reservation_status_date
## 1 2015-07-01
## 2 2015-07-01
## 3 2015-07-02
## 4 2015-07-02
## 5 2015-07-03
## 6 2015-07-03
To get a better understanding of the structure of the dataset, we can use the str() function. This function provides information about the structure of the dataset, such as the number of observations, the number of variables, and the type of data in each variable.
str(data_hotel)Click to show/hide output
## 'data.frame': 119390 obs. of 32 variables:
## $ hotel : chr "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
## $ is_canceled : int 0 0 0 0 0 0 0 0 1 1 ...
## $ lead_time : int 342 737 7 13 14 14 0 9 85 75 ...
## $ arrival_date_year : int 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
## $ arrival_date_month : chr "July" "July" "July" "July" ...
## $ arrival_date_week_number : int 27 27 27 27 27 27 27 27 27 27 ...
## $ arrival_date_day_of_month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ stays_in_weekend_nights : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stays_in_week_nights : int 0 0 1 1 2 2 2 2 3 3 ...
## $ adults : int 2 2 1 1 2 2 2 2 2 2 ...
## $ children : int 0 0 0 0 0 0 0 0 0 0 ...
## $ babies : int 0 0 0 0 0 0 0 0 0 0 ...
## $ meal : chr "BB" "BB" "BB" "BB" ...
## $ country : chr "PRT" "PRT" "GBR" "GBR" ...
## $ market_segment : chr "Direct" "Direct" "Direct" "Corporate" ...
## $ distribution_channel : chr "Direct" "Direct" "Direct" "Corporate" ...
## $ is_repeated_guest : int 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_cancellations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_bookings_not_canceled: int 0 0 0 0 0 0 0 0 0 0 ...
## $ reserved_room_type : chr "C" "C" "A" "A" ...
## $ assigned_room_type : chr "C" "C" "C" "A" ...
## $ booking_changes : int 3 4 0 0 0 0 0 0 0 0 ...
## $ deposit_type : chr "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
## $ agent : chr "NULL" "NULL" "NULL" "304" ...
## $ company : chr "NULL" "NULL" "NULL" "NULL" ...
## $ days_in_waiting_list : int 0 0 0 0 0 0 0 0 0 0 ...
## $ customer_type : chr "Transient" "Transient" "Transient" "Transient" ...
## $ adr : num 0 0 75 75 98 ...
## $ required_car_parking_spaces : int 0 0 0 0 0 0 0 0 0 0 ...
## $ total_of_special_requests : int 0 0 0 0 1 1 0 1 1 0 ...
## $ reservation_status : chr "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
## $ reservation_status_date : chr "2015-07-01" "2015-07-01" "2015-07-02" "2015-07-02" ...
Another useful function for getting a quick overview of the data is the summary() function. This function provides summary statistics for each variable in the dataset.
summary(data_hotel)Click to show/hide output
## hotel is_canceled lead_time arrival_date_year
## Length:119390 Min. :0.0000 Min. : 0 Min. :2015
## Class :character 1st Qu.:0.0000 1st Qu.: 18 1st Qu.:2016
## Mode :character Median :0.0000 Median : 69 Median :2016
## Mean :0.3704 Mean :104 Mean :2016
## 3rd Qu.:1.0000 3rd Qu.:160 3rd Qu.:2017
## Max. :1.0000 Max. :737 Max. :2017
##
## arrival_date_month arrival_date_week_number arrival_date_day_of_month
## Length:119390 Min. : 1.00 Min. : 1.0
## Class :character 1st Qu.:16.00 1st Qu.: 8.0
## Mode :character Median :28.00 Median :16.0
## Mean :27.17 Mean :15.8
## 3rd Qu.:38.00 3rd Qu.:23.0
## Max. :53.00 Max. :31.0
##
## stays_in_weekend_nights stays_in_week_nights adults
## Min. : 0.0000 Min. : 0.0 Min. : 0.000
## 1st Qu.: 0.0000 1st Qu.: 1.0 1st Qu.: 2.000
## Median : 1.0000 Median : 2.0 Median : 2.000
## Mean : 0.9276 Mean : 2.5 Mean : 1.856
## 3rd Qu.: 2.0000 3rd Qu.: 3.0 3rd Qu.: 2.000
## Max. :19.0000 Max. :50.0 Max. :55.000
##
## children babies meal country
## Min. : 0.0000 Min. : 0.000000 Length:119390 Length:119390
## 1st Qu.: 0.0000 1st Qu.: 0.000000 Class :character Class :character
## Median : 0.0000 Median : 0.000000 Mode :character Mode :character
## Mean : 0.1039 Mean : 0.007949
## 3rd Qu.: 0.0000 3rd Qu.: 0.000000
## Max. :10.0000 Max. :10.000000
## NA's :4
## market_segment distribution_channel is_repeated_guest
## Length:119390 Length:119390 Min. :0.00000
## Class :character Class :character 1st Qu.:0.00000
## Mode :character Mode :character Median :0.00000
## Mean :0.03191
## 3rd Qu.:0.00000
## Max. :1.00000
##
## previous_cancellations previous_bookings_not_canceled reserved_room_type
## Min. : 0.00000 Min. : 0.0000 Length:119390
## 1st Qu.: 0.00000 1st Qu.: 0.0000 Class :character
## Median : 0.00000 Median : 0.0000 Mode :character
## Mean : 0.08712 Mean : 0.1371
## 3rd Qu.: 0.00000 3rd Qu.: 0.0000
## Max. :26.00000 Max. :72.0000
##
## assigned_room_type booking_changes deposit_type agent
## Length:119390 Min. : 0.0000 Length:119390 Length:119390
## Class :character 1st Qu.: 0.0000 Class :character Class :character
## Mode :character Median : 0.0000 Mode :character Mode :character
## Mean : 0.2211
## 3rd Qu.: 0.0000
## Max. :21.0000
##
## company days_in_waiting_list customer_type adr
## Length:119390 Min. : 0.000 Length:119390 Min. : -6.38
## Class :character 1st Qu.: 0.000 Class :character 1st Qu.: 69.29
## Mode :character Median : 0.000 Mode :character Median : 94.58
## Mean : 2.321 Mean : 101.83
## 3rd Qu.: 0.000 3rd Qu.: 126.00
## Max. :391.000 Max. :5400.00
##
## required_car_parking_spaces total_of_special_requests reservation_status
## Min. :0.00000 Min. :0.0000 Length:119390
## 1st Qu.:0.00000 1st Qu.:0.0000 Class :character
## Median :0.00000 Median :0.0000 Mode :character
## Mean :0.06252 Mean :0.5714
## 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :8.00000 Max. :5.0000
##
## reservation_status_date
## Length:119390
## Class :character
## Mode :character
##
##
##
##
We can also check for missing values in the dataset. If the missing values are few and negligible, we can ignore them.
sum(is.na(data_hotel))## [1] 4
In our case, we have only 4 missing values, which we can ignore as insignificant.
4. Data Subset
The original dataset has 119,390 rows. As the dataset size is too large, we will select a random subset of 10% of the size of the original dataset to perform our analysis. To ensure reproducibility, we set a random seed.
# Calculate the size of the dataset
dataset_size <- nrow(data_hotel)
# Set the random seed to ensure reproducibility
set.seed(123)
# Select a random subset of 10% of the size of the original dataset
subset_size <- round(0.1 * dataset_size)
hotel_subset <- data_hotel[sample(dataset_size, subset_size), ]
# Display number of rows in hotel_subset
nrow(hotel_subset)## [1] 11939
Now our dataset has 11,939 rows.
5. Obtaining Geographic Location Data for Hotel Subsets
As we continue with the analysis of hotel booking demand, we realized that the hotel_subset dataset we had created earlier lacks geolocation data. To address this, we obtained a new dataset from Github and called it country_loc that contains the geographic coordinates of various countries. We loaded the country_loc dataset using the read.csv() function and viewed the data using the head() function.
country_loc <- read.csv("countries_codes_and_coordinates.csv")
head(country_loc)## Country Alpha.2.code Alpha.3.code Numeric.code Latitude..average.
## 1 Afghanistan AF AFG 4 33.0000
## 2 Albania AL ALB 8 41.0000
## 3 Algeria DZ DZA 12 28.0000
## 4 American Samoa AS ASM 16 -14.3333
## 5 Andorra AD AND 20 42.5000
## 6 Angola AO AGO 24 -12.5000
## Longitude..average.
## 1 65.0
## 2 20.0
## 3 3.0
## 4 -170.0
## 5 1.6
## 6 18.5
6. Data Cleaning
We previously found out while inspecting the data_hotel variable that the arrival date exists as 3 separate columns, namely arrival_date_year, arrival_date_year, arrival_date_month, and arrival_date_day_of_month. We start by combining the year, month, and day columns into a single date column to simplify the data format. We can do this using the as.Date function along with paste.
# New combined arrival_date column
hotel_subset$arrival_date <- as.Date(with(hotel_subset,paste(arrival_date_year,arrival_date_month,arrival_date_day_of_month,sep="-")),format = "%Y-%B-%d")
head(hotel_subset$arrival_date)## [1] "2016-05-24" "2016-10-02" "2015-11-09" "2016-11-06" "2016-08-14"
## [6] "2016-12-07"
We then discovered that the country column in the hotel_subset dataset and the Alpha.3.code column in the country_loc dataset have different lengths. To investigate further, we used the length function to check the number of unique values in each column, and found that they differ.
length(unique(hotel_subset$country))## [1] 113
length(unique(country_loc$Alpha.3.code))## [1] 244
Upon checking if the column names match, we found that there are no matching values between the two columns.
intersect(x = hotel_subset$country,y = country_loc$Alpha.3.code)## character(0)
To further investigate the differences, we examined the values in the Alpha.3.code column and discovered that they have leading spaces. To get rid of these spaces, we use the str_remove function from the stringr library.
# Investigate Alpha.3.code column values
head(country_loc$Alpha.3.code)## [1] " AFG" " ALB" " DZA" " ASM" " AND" " AGO"
# Remove leading spaces
library(stringr)
country_loc$Alpha.3.code <- str_remove(string= country_loc$Alpha.3.code,pattern = "^ +")
head(country_loc$Alpha.3.code)## [1] "AFG" "ALB" "DZA" "ASM" "AND" "AGO"
7. Combining Datasets for Geolocation data
After removing the leading spaces, we checked if the values in the two columns intersect, and found that they do. Thus, we merged the two datasets using the merge function, joining them on the country column from hotel_subset and the Alpha.3.code column from country_loc.
merged_loc <- merge(hotel_subset, country_loc, by.x = "country", by.y = "Alpha.3.code")Finally, we verified that the resulting dataset, merged_loc, has geolocation per country without any extra missing values generated.
sum(is.na(merged_loc))## [1] 0
8. Data Preparation and Exploration
After merging the two datasets, we need to prepare and explore the data further. We create a data frame df that contains two columns: “index” containing indices of columns in the data frame merged_loc and “colnames” containing names of columns in merged_loc. This will help us in indexing the columns later.
df <- data.frame(index = 1:length(colnames(merged_loc)),
colnames = colnames(merged_loc))
head(df)## index colnames
## 1 1 country
## 2 2 hotel
## 3 3 is_canceled
## 4 4 lead_time
## 5 5 arrival_date_year
## 6 6 arrival_date_month
We create a new data frame cor_hotel by filtering out the columns of interest from the merged_loc data frame using the indexes.
cor_hotel <- merged_loc[c(3,4,9,10,11,12,13,17,18,19,22,26,28,29,30)]
head(cor_hotel)Click to show/hide output
## is_canceled lead_time stays_in_weekend_nights stays_in_week_nights adults
## 1 1 28 4 9 1
## 2 0 1 2 4 2
## 3 0 0 0 2 2
## 4 0 4 2 10 2
## 5 1 188 7 17 2
## 6 1 9 6 17 1
## children babies is_repeated_guest previous_cancellations
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## previous_bookings_not_canceled booking_changes days_in_waiting_list adr
## 1 0 0 0 172.15
## 2 0 0 0 90.33
## 3 0 1 0 136.00
## 4 0 0 0 99.33
## 5 0 0 0 90.95
## 6 0 0 0 68.23
## required_car_parking_spaces total_of_special_requests
## 1 0 0
## 2 0 1
## 3 0 1
## 4 0 0
## 5 0 1
## 6 0 1
We confirm that all the variables are now non-character type by using the str function.
str(cor_hotel)## 'data.frame': 11838 obs. of 15 variables:
## $ is_canceled : int 1 0 0 0 1 1 0 1 1 1 ...
## $ lead_time : int 28 1 0 4 188 9 29 239 73 7 ...
## $ stays_in_weekend_nights : int 4 2 0 2 7 6 0 0 4 3 ...
## $ stays_in_week_nights : int 9 4 2 10 17 17 3 1 10 5 ...
## $ adults : int 1 2 2 2 2 1 1 2 3 2 ...
## $ children : int 0 0 0 0 0 0 0 2 0 2 ...
## $ babies : int 0 0 0 0 0 0 0 0 0 0 ...
## $ is_repeated_guest : int 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_cancellations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_bookings_not_canceled: int 0 0 0 0 0 0 0 0 0 0 ...
## $ booking_changes : int 0 0 1 0 0 0 0 0 0 0 ...
## $ days_in_waiting_list : int 0 0 0 0 0 0 0 0 0 0 ...
## $ adr : num 172.2 90.3 136 99.3 91 ...
## $ required_car_parking_spaces : int 0 0 0 0 0 0 0 0 0 0 ...
## $ total_of_special_requests : int 0 1 1 0 1 1 0 1 0 0 ...
9. Data visualization
9.1 Correlation heatmap
We calculate the correlation matrix using the cor function. We then use the ggplot2 and reshape2 libraries to create a heatmap with column names on both axes.
We first melt the correlation matrix into a long format using the melt function, and then set the column names for the melted dataframe. Finally, we plot the heatmap using the ggplot2 package, using geom_tile to fill each cell, scale_fill_gradient2 to set the color gradient, and theme and labs to add appropriate labeling.
# Calculate the correlation matrix
corr_hotel <- cor(cor_hotel)
# Create a heatmap with column names on both axes
library(ggplot2)
library(reshape2)
# Melt the correlation matrix into long format
corr_melted <- melt(corr_hotel)
# Set column names for melted data frame
colnames(corr_melted) <- c("Var1", "Var2", "Corr")
# Plot the heatmap with ggplot2
ggplot(data = corr_melted, aes(x = Var1, y = Var2, fill = Corr)) +
geom_tile() +
scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0, limit = c(-1,1)) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(x = "", y = "", fill = "Correlation")The resulting heatmap provides insights into the relationship between different variables in the dataset. The color gradient helps visualize the strength and direction of the correlation, with red indicating a positive correlation and blue indicating a negative correlation. The diagonal line shows the correlation of each variable with itself, which is always 1.
There appears to be significant positive correlation between adr and children variables. Previous_bookings_not_cancelled and is_repeat_guest also have positive correlation. Total_of_special_guests and is_cancelled are negatively correlated.
9.2 Line chart: Seasonality of hotel bookings
In this section, we will investigate the seasonality of hotel bookings. We will use a line chart to visualize the trend in the number of bookings over time.
First, we load the required packages: dplyr and lubridate. We will use the dplyr package to aggregate data and lubridate to extract year and month from the arrival date.
library(dplyr)
library(lubridate)
options(dplyr.summarise.inform = FALSE)Then, we aggregate the data by counting the number of bookings each month for both years (2015 and 2016). We store the aggregated data in a new data frame called “hotel_data_agg”.
hotel_data_agg <- merged_loc %>%
group_by(year = year(merged_loc$arrival_date), month = month(merged_loc$arrival_date)) %>%
summarize(bookings = n())Finally, we create a line chart using ggplot2. We use the x-axis to represent the months (from 1 to 12) and the y-axis to represent the number of bookings. We add both a line and point to the chart. We also add labels to the x-axis, y-axis, and title to make the chart more informative.
ggplot(hotel_data_agg, aes(x = month, y = bookings)) +
geom_line() +
geom_point(color = "red") +
xlab("Months \nYear- 2015 to 2016") +
ylab("Number of Bookings") +
ggtitle("Number of Bookings over Time")The line chart shows that there is a gradual increase in the number of bookings from January to July, with a steep drop in August. After that, the number of bookings gradually increases again until it reaches its peak in October, followed by a decline in December.
9.3 Histogram - Frequency distribution of lead time data
The lead_time column is extracted from the dataset and is used to create a histogram that shows the frequency distribution of lead times. The hist() function is used to create the histogram.
# Load ggplot2 library
library(ggplot2)
# Extract lead time column
lead_time <- merged_loc$lead_time
# Create ggplot histogram of lead times
ggplot(merged_loc, aes(x = lead_time)) +
geom_histogram(binwidth = 30, color = "black", fill = "lightblue") +
labs(title = "Histogram of Lead Times",
x = "Lead Time (days)",
y = "Count")The lead times histogram shows that the majority of bookings in the dataset have a short lead time, with the highest peak occurring at around 0-30 days. The plot also shows a long tail towards the right, indicating that there are also bookings with longer lead times.
9.4 Bar Chart: Booking Cancellation
In order to examine the distribution of booking cancellations in the dataset, a barchart is generated using ggplot2 package. The count of cancelled and not-cancelled bookings is represented in the barchart.
library(ggplot2)
# add a new column to identify cancelled or not
merged_loc$cancelled <- ifelse(merged_loc$is_canceled == 1, "cancelled", "not cancelled")
# plot barchart with count of cancelled and not-cancelled bookings
ggplot(merged_loc, aes(x = cancelled, fill = cancelled)) +
geom_bar() +
scale_fill_manual(values = c("lightblue", "darkblue"),
labels = c("Cancelled", "Not Cancelled")) +
labs(title = "Count of Cancelled and Not-cancelled Bookings",
x = "Cancellation Status",
y = "Count")The chart clearly shows that there are more not-cancelled bookings than cancelled bookings in the dataset.
9.5 Bar Chart: Compare Average Daily Rate (ADR) between city and resort hotels.
Firstly, the mean ADR for each hotel type is calculated using the aggregate() function. The resulting data is stored in the mean_adr_by_hotel_type object.
A bar plot is then created using barplot(), where the ADR values are plotted on the y-axis, and the hotel types are displayed on the x-axis. The x-axis labels are specified using the names.arg argument, and the x-axis and y-axis labels are set using the xlab and ylab arguments. Finally, the main title of the plot is set using the main argument.
# Calculate the mean adr for each hotel type
mean_adr_by_hotel_type <- aggregate(adr ~ hotel, data = merged_loc, mean)
# Create a ggplot bar plot
ggplot(mean_adr_by_hotel_type, aes(x = hotel, y = adr, fill = hotel)) +
geom_bar(stat = "identity", color = "black") +
scale_fill_manual(values = c("darkblue", "lightblue")) +
labs(title = "Comparison of Average Daily Rate (ADR) between City and Resort Hotels",
x = "Hotel Type", y = "Average Daily Rate (ADR)")The bar plot conclusively shows that the average daily rate of the city hotels is higher than that of the resort hotels.
9.6 Interactive World Map: Distribution of hotel bookings and lead times across countries
This code uses the ggplot2 and ggmap libraries to create an interactive world map with plotted hotel data points, which can be used to visually explore the geographical distribution of hotel bookings and lead times across countries.
First, the code loads the required libraries.
library(ggplot2)
library(ggmap)
library(ggiraph)Next, the map_data() function is used to create data for world coordinates.
world_coordinates <- map_data("world")The ggplot() function is then used to create the world map with geom_map(). This function takes the world coordinates as input to plot the map. The color, fill, and size parameters are used to specify the color and size of the points.
# create an aggregated dataset by Country
merged_loc_agg <- merged_loc %>%
group_by(Country) %>%
summarise(mean_lead_time = round(mean(lead_time)))
p <- ggplot() +
# geom_map() function takes world coordinates as input to plot world map
geom_map(
data = world_coordinates, map = world_coordinates,
aes(long, lat, map_id = region),
color = "white", fill = "lightblue", size = 0.2
) +
# geom_point() adds points for each hotel subset by country with lead time
geom_point_interactive(
data = merged_loc,
aes(y = Latitude..average.,x = Longitude..average., color = "red",
size=lead_time, tooltip=paste("Country: ", Country,
"<br>Mean Lead Time: ", merged_loc_agg$mean_lead_time[match(Country, merged_loc_agg$Country)],
" days", sep="")),
alpha = 0.5
) +
# remove the legend
theme(legend.position="none")Finally, the ggiraph() function is used to create an interactive plot by passing the print() output of the ggplot() object as a code argument.
ggiraph(code = print(p))10. Modelling
10.1 Two-sample t-test: Average Daily Rate (ADR) between City and Resort hotel
The code below performs a two-sample t-test to compare the Average Daily Rate (ADR) between City and Resort hotel. The first step in performing a t-test is to create two subsets of the ADR data: one for the City hotel and one for the Resort hotel. This is achieved using the subset() function in R, which extracts the ADR data for the City and Resort hotels separately using the hotel column.
city_adr <- subset(merged_loc, hotel == "City Hotel")$adr
resort_adr <- subset(merged_loc, hotel == "Resort Hotel")$adrNext, the t.test() function is used to perform a two-sample t-test, which calculates the difference between the means of the two subsets of data and determines whether this difference is statistically significant. The results of the t-test are stored in the t_test_result variable.
t_test_result <- t.test(city_adr, resort_adr)Finally, the results of the t-test are printed to the console using the variable t_test_result.
t_test_result##
## Welch Two Sample t-test
##
## data: city_adr and resort_adr
## t = 9.2521, df = 5528.7, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 7.92565 12.18734
## sample estimates:
## mean of x mean of y
## 105.73628 95.67978
The Welch Two Sample t-test was used to compare the means of two independent groups represented by city_adr and resort_adr variables. The p-value is very small, indicating that the null hypothesis of equal means can be rejected in favor of the alternative hypothesis that the two groups are not the same. The 95% confidence interval suggests that the true difference in means lies between 7.93 and 12.19, and the sample means indicate that the city_adr group has a significantly higher mean value than the resort_adr group. Overall, there is a statistically significant difference in the mean values of the two groups.
10.2 Linear Regression - ADR and Children + Stays_in_week_nights
The number of children in a booking could potentially affect the ADR, as families with children may require additional accommodations or services that could drive up the price. To investigate this relationship, we fit a linear regression model between ADR and Children, as well as stays in weeknights, using the merged_loc dataset.
#Fit a linear regression model between ADR and Children and stays in weeknights
model <- lm(adr ~ children + stays_in_week_nights, data = merged_loc)
#Print model summary
summary(model)##
## Call:
## lm(formula = adr ~ children + stays_in_week_nights, data = merged_loc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -177.003 -30.682 -5.682 24.318 274.497
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 94.0546 0.6988 134.593 < 2e-16 ***
## children 40.6605 1.0267 39.603 < 2e-16 ***
## stays_in_week_nights 1.6272 0.2247 7.243 4.66e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 45.36 on 11835 degrees of freedom
## Multiple R-squared: 0.1224, Adjusted R-squared: 0.1223
## F-statistic: 825.7 on 2 and 11835 DF, p-value: < 2.2e-16
The coefficients table shows that both the intercept and slope are significant. The R-squared values indicate that the model explains 11.83% of the variation in adr, and adding the children variable did not improve the fit significantly. The F-statistic shows that the children variable is a significant predictor of adr. The regression model suggests that for each additional child, adr increases by approximately 41.03 units. Therefore, the number of children is a significant predictor of adr.
Next, we focus on the relationship between ADR and Children in the hotel_subset dataset using a scatter plot and linear regression model. To achieve this, we use the ggplot2 and ggpubr libraries to plot a scatter plot and overlay a linear regression line.
#Load required libraries
library(ggplot2)
library(ggpubr)
#Fit linear regression model
lm_model <- lm(adr ~ children, data = hotel_subset)
#Plot linear regression model
ggscatter(hotel_subset, x = "children", y = "adr",
add = "reg.line", conf.int = TRUE,
cor.coef = TRUE, cor.method = "pearson",
xlab = "Children", ylab = "ADR",
title = "Linear Regression Model: ADR vs. Children")The scatter plot with the linear regression line confirms the output of the model summary.
11. Chapter of Choice - Animation of Visualization and Improvement of Visual Appeal of the Report
In this section, we’ll improve our previous visualization - ‘Aggregate Booking demand of hotels from 2015 to 2016’ using the ggplot2 package in R.
Process
We’ll start by loading the necessary packages and using the base code from the original line plot.
library(dplyr)
library(lubridate)
library(ggplot2)
library(ggthemes)
library(gganimate)# Aggregate data by year and month
hotel_data_agg <- hotel_subset %>%
group_by(year = year(arrival_date), month = month(arrival_date)) %>%
summarize(bookings = n())
# Plot the line graph and create animation frames
p <- ggplot(hotel_data_agg, aes(x = month, y = bookings)) +
geom_line() +
geom_point(color = "red", size = 5) +
xlab("Months \nYear- 2015 to 2016") +
ylab("Number of Bookings") +
ggtitle("Number of Bookings over Time") +
transition_reveal(month) +
theme_economist() In the code above, we create a line graph with ggplot2, where the x-axis represents the month and the y-axis represents the number of bookings. We also use the geom_line() function to draw the line, geom_point() to add the red points representing the bookings, and ggtitle() to add a title to the graph. We then add an animation to the graph using gganimate’s transition_reveal() function, which creates a smooth transition between frames, and apply a modern theme with theme_economist() from ggthemes package.
Render animation as a GIF
We can now render the graph as a GIF to see the booking demand over time.
library(gganimate)
animate(p, nframes = 50, fps = 10, renderer = gifski_renderer())The animated graph aggregate moving trend in booking demand from January 2015 to August 2016. The number of bookings peaked in May and then again around September.
Improvement of Visual Appeal of the Report
The visual appeal of a report plays a crucial role in engaging readers and communicating insights effectively.
We used the prettydoc R package to enhance the visual appeal of our report. With the YAML header output: prettydoc::html_pretty, we selected the cayman theme, which has a modern and sleek design. Additionally, we specified the highlight style to be github, which adds syntax highlighting to our code chunks. These small changes have a big impact on the readability and aesthetics of the final report.
#title: "Analysis of Hotel Booking Demand"
#author: "Samuel Devdas & RituRaj Singh"
#date: ""
#output:
#prettydoc::html_pretty:
#theme: cayman
#highlight: github
#warning: FALSEThe warning: FALSE line in the YAML header of the R Markdown document is used to suppress the display of warning messages during the knitting of the document. This can be useful in cases where warnings are not relevant to the analysis being presented or when warnings clutter the output and make it more difficult to read.
12. Conclusion
In conclusion, we have identified several patterns in the data that can be used to inform decision-making by hotel owners and managers.
Our analysis revealed that there is a positive correlation between adr and the number of children, as well as a positive correlation between previous bookings not cancelled and being a repeat guest. On the other hand, there is a negative correlation between the number of special guests and the likelihood of cancellation.
We also found that there is a seasonal pattern in booking demand, with a peak in October and a decline in December. Furthermore, the lead times for bookings are generally short, with the majority of bookings made within 30 days of arrival.
In terms of hotel type, our analysis showed that the average daily rate for city hotels is higher than that of resort hotels, and this difference was found to be statistically significant. Finally, we found that the number of children is a significant predictor of adr.
Based on these findings, we recommend that hotel owners and managers consider the impact of the number of children on adr when setting prices. They should also pay close attention to seasonal patterns in booking demand and adjust their pricing and marketing strategies accordingly. Additionally, hotel owners and managers should focus on minimizing cancellations by offering incentives for repeat bookings and ensuring excellent customer service.
Overall, this analysis demonstrates the value of data-driven decision-making in the hotel industry and provides actionable insights that can be applied to improve revenue and customer satisfaction.